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Abstract 

Independent representations have recently attracted significant attention from the 
biological vision and cognitive science communities. It has been 1) argued that prop- 
erties such as sparseness and independence play a major role in visual perception, and 
2) shown that imposing such properties on visual representations originates receptive 
fields similar to those found in human vision. We present a study of the impact of 
feature independence in the performance of visual recognition architectures. The con- 
tributions of this study are of both theoretical and empirical natures, and support two 
main conclusions. The first is that the intrinsic complexity of the recognition problem 
(Bayes error) is higher for independent representations. The increase can be significant, 
close to 10% in the databases we considered. The second is that criteria commonly used 
in independent component analysis are not sufficient to eliminate all the dependencies 
that impact recognition. In fact, "independent components" can be less independent 
than previous representations, such as principal components or wavelet bases. 
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1 Introduction 

After decades of work in the area of visual recognition (in the multiple guises of object 
recognition, texture classification, and image retrieval, among others) there are still 
several fundamental questions on the subject which, by large, remain unanswered. One 
of the core components of any recognition architecture is the feature transformation, a 
mapping from the space of image pixels to a feature space with better properties for 
recognition. While numerous features have been proposed over the years for various 
recognition tasks, there has been small progress towards either 1) a universally good 
feature set, or 2) a universal and computationally efficient algorithm for the design of 
optimal features for any particular task. 

In the absence of indisputable universal guidelines for feature design, one good 
source of inspiration has always been the human visual system. Ever since the work of 
Hubel and Wiesel [11], it has been established that 1) visual processing is local, and 2) 
different groups in primary visual cortex (i.e. area VI) are tuned for detecting different 
types of stimulus (e.g. bars, edges, and so on). This indicates that, at the lowest level, 
the architecture of the human visual system can be well approximated by a multi- 
resolution representation localized in space and frequency, and several "biologically 
plausible" models of early vision are based on this principle [20, 15, 2, 10, 21, 3]. All 
these models share a basic common structure consisting of three layers: a space/space- 
frequency decomposition at the bottom, a middle stage introducing a non-linearity, 
and a final stage pooling the responses from several non-linear units. They therefore 
suggest the adoption of a mapping from pixel-based to space/space-frequency repre- 
sentations as a suitable universal feature transformation for recognition. 

A space/space-frequency representation is obtained by convolving the image with a 
collection of elementary filters of reduced spatial support and tuned to different spatial 
frequencies and orientations. Traditionally, the exact shape of the filters was not con- 
sidered very important, as long as they were localized in both space and frequency, and 
several elementary filters have been proposed in the literature, including differences 
ofGaussians [15], Gabor functions [18, 10], and differences of offset Gaussians [15], 
among others. More recently, this presumption has been challenged by various authors 
on the basis that the shape of the filters determines fundamentally important properties 
of the representation, such as sparseness [9, 17] and independence [1]. 

These claims have been supported by (quite successful) showings that the enforce- 
ment of sparseness or independence constraints on the design of the feature transfor- 
mation leads to representations which exhibit remarkable similarity to the receptive 
fields of cells found in VI [17, 1]. However, while the arguments are appealing and 
the pictures compelling, there is, to the best of our knowledge, no proof that sparse- 
ness or independence are, indeed, fundamental requirements for visual recognition. 
On the contrary, not all evidence supports this conjecture. For example, detailed sta- 
tistical analysis of the coefficients of wavelet transforms (an alternative class of sparse 
features which exhibit similar receptive fields) has revealed the existence of clear inter- 
dependencies [19]. 

In what concerns the design of practical recognition systems, properties such as 
sparseness or independence are important only insofar as they enable higher-level goals 
such as computational efficiency or small probability of error. Under a Bayesian view 
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of perception [14], these two goals are, in fact, closely inter-related: implementation 
of minimum probability of error (MPE) decisions requires accurate density estimates, 
which are very difficult to obtain in high-dimensional feature spaces. The advantage 
of an independent representation is to decouple the various dimensions of the space, 
allowing high dimensional estimates to be computed by the simple multiplication of 
scalars. In this sense, independence can be a crucial enabler for accurate recognition 
with reduced complexity. On the other hand, it is known that any feature transformation 
has the potential to increase Bayes error, the ultimate lower-bound on the probability 
of error that any recognition architecture can achieve, for a given feature space. It is 
not clear that independent feature spaces are guaranteed to exhibit lower Bayes error 
than non-independent ones. In fact, since the independence constraint restricts the set 
of admissible transforms, it is natural to expect the opposite. 

Due to all of this, while there seem to be good reasons for the use of indepen- 
dent or sparse representations, it is not clear that they will lead to optimal recognition. 
Furthermore, it is usually very difficult to determine, in practice, if goals such as inde- 
pendence are actually achieved. In fact, because guaranteeing independence is a terri- 
bly difficult endeavor in high-dimensions, independent component analysis techniques 
typically resort to weaker goals, such as minimizing certain cumulants, or searching for 
non-Gaussian solutions. While an independent representation will meet these weaker 
goals, the reverse does not usually hold. In practice, it is in general quite difficult to 
evaluate by how much the true goal of independence has been missed. 

In this work we address two questions regarding the role of independence. The first 
is fundamental in nature: "how important is independence for visual recognition?". 
The second is relevant for the design of recognition systems: "how realistic is the 
expectation of actually enforcing independence constraints in real recognition scenar- 
ios?". To study these questions we built a complete recognition system and compared 
the performance of various feature transforms which claim different degrees of inde- 
pendence: from generic features that make no independence claims (but were known to 
have good recognition performance), to features (resulting from independent compo- 
nent analysis) which are supposed to be independent, passing through transforms that 
only impose very weak forms of independence, such as decorrelation. 

It turns out that, with the help of some simple theoretical results, the analysis of the 
recognition accuracy achieved by the different transforms already provides significant 
support for the following qualitative answers to the questions above. First, it seems to 
be the case that imposing independence constraints increases the intrinsic complexity 
(Bayes error) of the recognition problem. In fact, our data supports the conjecture that 
this intrinsic complexity is monotonically increasing on the degree of independence. 
Second, it seems clear that great care needs to be exercised in the selection of the 
independence measures used to guide the design of independent component transfor- 
mations. In particular, our results show that approaches such as minimizing cumulants 
or searching for non-Gaussian solutions are not guaranteed to achieve this goal. In 
fact, they can lead to "independent components" that are less independent than those 
achieved with "decorrelating" representations such as principal component analysis or 
wavelets. 
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2 Bounds on recognition accuracy 

A significant challenge for empirical evaluation is to provide some sort of guarantees 
that the observed results are generalizable. This challenge is particularly relevant in 
the context of visual recognition, since it is impossible to implement all the recogni- 
tion architectures that could ever be conceived. For example, the fact that we rely on a 
Bayesian classification paradigm should not compromise the applicability of the con- 
clusions to recognition scenarios based on alternative classification frameworks (e.g. 
discriminant techniques such as neural networks or support vector machines). This 
goal can only be met with recourse to theoretical insights on the performance of recog- 
nition systems, which are typically available in the form of bounds on the probability 
of classification error. 

The most relevant of these bounds is that provided by the Bayes error, which is the 
minimum error that any architecture can achieve in a given classification problem. 

Theorem 1 Given a feature space X and a query x € X, the decision function which 
minimizes the probability of classification error is the Bayes or maximum a posteriori 
(MAP) classifier 

g*(x) = argmaxPy| X (i|x), (1) 

i 

where Y is a random variable that assigns x to one of M classes, andi £ {1, . . . , M}. 
Furthermore, the probability of error is lower bounded by the Bayes error 

L* = l-£ x [ma X Py|x(*|x)] ! (2) 

i 

where E x means expectation with respect to Px(x). 
Proof: see [23]. 

The significance of this theorem is that any insights on the Bayes error that may be 
derived from observations obtained with a particular recognition architecture are valid 
for all architectures, as long as the feature space X is the same. The following theorem 
shows that a feature transformation can never lead to smaller error in the transformed 
space than that achievable in the domain space. 

Theorem 2 Given a classification problem with observation space Z and a feature 
transformation 

T:Z^X, 

then 

L* x > L* z (3) 

where L* z and L* x are, respectively, the Bayes errors on Z and X. Furthermore, 
equality is achieved if and only ifT is an invertible transformation. 

Proof: see [23]. 

The last statement of the theorem is a worst-case result. In fact, for a specific classi- 
fication problem, it may be possible to find non-invertible feature transformations that 
do not increase Bayes error. What is not possible is to find 1) a feature transformation 
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that will reduce the Bayes error, or 2) a universal non-invertible feature transformation 
guaranteed not to increase the Bayes error on all classification problems. 

Since Bayes error is an intrinsic measure of the complexity of a classification prob- 
lem, the theorems above are applicable to any classification architecture. The following 
upper bounds are specific to a family of architectures that we will consider throughout 
this work, and are usually referred to as plug-in decision rules [8]. The basic idea is to 
rely on Bayes rule to invert (1) 

g*(x) = argmaxP X jy(x|j)Py(i), (4) 

i 

and then estimate the quantities P X |y(x|i) and Py (i) from training images. This leads 
to the following upper bound on the probability of error. 

Theorem 3 Given a classification problem with a feature space X, unknown class 
probabilities Py (i) and class conditional likelihood functions Px|y (x|i), and a deci- 
sion function 

5(x) = argmaxp X |y(x|j)j5y(i), (5) 
the difference between the actual and Bayes error, is upper bounded by 

P(g(X) ^Y)-L* x <y^f |P X |y(x|i)Py(i) -ftc|y(x|»)py (»)!<&■ (6) 



Proof: see [23]. 

In the remainder of this work we assume that the classes are a-priori equiprobable, 
i.e. Py(i) = 1/M, Vi. This leads to the following corollary. 

Corollary 1 Given a classification problem with equiprobable classes, a feature space 
X, unknown class conditional likelihood functions Px|y (x|i), and a decision function 

5(x) = argmaxp X |y(x|i), (7) 

i 

the difference between the actual and Bayes error is upper bounded by 

P(g(X) ?Y)-L%< A g , x (8) 

where 

A g , x = ^irL[P X |y(x|i)||px|y(x|i)], (9) 

i 

is the estimation error and 

KL[Px(x)||g x (x)] = / P x (x)log ^{rfx (10) 
J vx(xj 

is the relative entropy, or Kullback-Leibler divergence, between Px( x ) and <2x( x ). 
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Proof: see [23]. 

Bounds (3) and (8) reflect the impact of both feature selection and density esti- 
mation on recognition accuracy. While the feature transformation determines the best 
possible achievable performance, the quality of the density estimates determines how 
close the actual error is to this lower bound. Hence, for problems where density esti- 
mation is accurate one expects the actual error to be close to the Bayes error. On the 
other hand, when density estimates are poor, there are no guarantees that this will be 
the case. 

The latter tends to be the case for visual recognition, where high-dimensional fea- 
ture spaces usually make density estimation a difficult problem. It is, therefore, diffi- 
cult to determine if the error is mostly due to the intrinsic complexity of the problem 
(Bayes error) or to poor quality of density estimates. One of the contributions of this 
work is a strategy to circumvent this problem, based on the notion of embedded feature 
spaces [24]. 

Definition 1 Given two vector spaces X m and X n , m < n, such that dim(X m ) = m 
and dim(X n ) = n an embedding is a mapping 

e:X m -*X n (11) 

which is one-to-one. 

A canonical example of embedding is the zero padding operator for Euclidean spaces 

C : R m -»■ R" (12) 
where t£(x) = (x, 0), x G R m , and 0 £ R™~™. 

Definition 2 A sequence of vector spaces {Xi, X^}, such that dim(Xi) < dim(Xi + i), 
is called embedded if there exists a sequence of embeddings 

e i :X i ^X' i+1 ,i = \,...,d-\, (13) 

such that X! +1 C <-fj+i. 

The inverse operation of an embedding is a submersion. 

Definition 3 Given two vector spaces X m and X n , m < n, such that dim(X m ) = m 
and dim(X n ) = n a submersion is a mapping 

j:X n ^X m (14) 

which is surjective. 

A canonical example of submersion is the projection of Euclidean spaces along the 
coordinate axes 

tC : R" -> R™ (15) 

where 7r^ (xi , . . . ,x m ,x m+ i, . . . ,x„) = (x±, . . . ,x m ). The following theorem shows 
that any linear feature transformation originates a sequence of embedded vector spaces 
with monotonically decreasing Bayes error, and monotonically increasing estimation 
error. 
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Theorem 4 Let 

T : R d -> X C R d , 
be a linear feature transformation. Then, 

Xi = nf{X),i = l,...,d-l (16) 

is a sequence of embedded feature spaces such that 

L* Xi+1 <L* Xi . (17) 

Furthermore, z/Xf = {Xi, . . . , X<;} is a sequence or random variables such that 

X i = 7r?(X),» = l,...,d (18) 
and {g(x)} d a sequence of decision functions 

5i(x) =argmaxp x .|y(x|A;) (19) 

then 

A gi+uXi+1 > A guXi . (20) 

Proof: see Appendix A. 

Figure 1 illustrates the evolution of the upper and lower bounds on the probability 
of error as one considers successively higher-dimensional subspaces of X. Since accu- 
rate density estimates can usually be obtained in low-dimensions, the two bounds tend 
to be close when the subspace dimension is small. In this case, the actual probability 
of error is dominated by the Bayes error. For higher-dimensional subspaces two dis- 
tinct scenarios are possible, depending on the independence of the individual random 
variables X^. Whenever these variables are dependent, the decrease in Bayes error 
tends to be cancelled by an increase in estimation error and the actual probability of 
error increases. In this case, the actual probability of error exhibits the concave shape 
depicted in the left plot, where an inflection point marks the subspace dimension for 
which Bayes error ceases to be dominant. 

The right plot depicts the situation where the variables Xi are independent. In this 
case, it can be shown that (see proof of Theorem 4) 

A gi+uX+1 - A gi , Xi =^KL[P Xi+l]Y (x\k)\\p Xi+l] y(x\k)], (21) 
k 

i.e. the increase in overall estimation error is simply the sum of the errors of the in- 
dividual scalar estimates. Since these errors tend to be small, one expects the overall 
probability of error to remain approximately flat. 

Hence, the shape of the curve of probability of error as a function of the subspace 
dimension carries significant information about 1) the Bayes error in the full space X 
and 2) the independence of the component random variables Xi. We will see in sec- 
tion 5 that this information is sufficient to draw, with reasonable certainty, conclusions 
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Figure 1 : Upper bound, lower bound, and actual probability of error as a function of 
subspace dimension. Left: dependent features. Right: independent features. 

such as "the Bayes error of transform T is greater than that of transform U". With 
regards to independence, the ultimate test is, of course, to implement a recognition 
system based on estimates of the joint density Px (x) and compare with a recognition 
system based on the independence assumption, i.e. -Px(x) = Yli Pxi(xi)- When in- 
dependence holds, the two systems will achieve the same recognition rates. From now 
on, we will refer to the former system as based on joint modeling and to the latter as 
based on independent modeling or on the product of marginals. 

3 Feature transforms 

Since the goal is to evaluate the impact of independence on visual recognition, it is 
natural to study transformations that lead to features with different degrees of indepen- 
dence. We restrict our attention to the set of transformations that perform some sort of 
space/space-frequency decomposition. In this context, the feature transformation is a 
mapping 

T : R k -> R d 

z — > x = Wz 

where zfl is a n x n image patch with columns stacked into a fc-dimensional vector 
(k = n 2 ) and W the transformation matrix. In general, k > d, and one can also define 
a reconstruction mapping 

R: M d M k 

x — > z = Ax 

from features x to pixels z. The columns of A are called basis functions of the transfor- 
mation. When d = k and A = W T the transformation is orthogonal. Various popular 
space/space-frequency representations are derived from orthogonal feature transforms. 
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Definition 4 The Discrete Cosine Transform (DCTj [13] of size n is the orthogonal 
transform whose basis functions are defined by: 

A(t, j) = a i a j cos ^ — — ^— cos v y > J , 0<i,j,x,y<n (22) 

2n 2n 

where a = y/l/n for i = 0, and a = \/2jn otherwise. 

The DCT has empirically been shown to have good decorrelation properties [13] and, 
in this sense, DCT features are at the bottom of the independence spectrum. Previous 
recognition results had shown, however, that it can lead to recognition rates comparable 
to or better than those of many features proposed in the recognition literature [22]. It 
is possible to show that, for certain classes of stochastic processes, the DCT converges 
asymptotically to the following transform [13]. 

Definition 5 Principal Components Analysis (PCAj is the orthogonal transform de- 
fined by 

W = D~ 1//2 E T , (23) 
where EDE T is the eigenvector decomposition of the covariance matrix E[zz T ]. 

It is well known (and straightforward to show) that PCA generates uncorrected fea- 
tures, i.e. i?[xx T ] = I. While they originate spatial/spatial-frequency representations, 
the major limitation of the above transforms as models for visual perception is the ar- 
bitrary nature of their spatial localization (enforced by arbitrarily segmenting images 
into blocks). This can result in severe scaling mismatches if the block size does not 
match that of the image detail. Such scaling problems are alleviated by the wavelet 
representation. 

Definition 6 A wavelet transform (WT) [16] is the orthogonal transform whose basis 
functions are defined by 

A(i,j) = v^* (2* X - i) * (2>y - j) SS^,,^ (24) 

where is a function (wavelet) that integrates to zero. 

Like the DCT, wavelets have been shown empirically to achieve good decorrelation. 
While this is an important part of independence (all of it when the inputs are Gaussian) 
there is in general a significant amount of higher-order dependencies that cannot be 
captured by orthogonal components [17]. Eliminating such dependencies is the goal of 
independent component analysis. 

Definition 7 Independent Component Analysis (ICA) [4] is a feature transform such 
that 

P x (x) = Jpx^Xi) (25) 

i 

where X = {X± , . . . , Xd} is the random process from which feature vectors are drawn. 
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An equivalent definition is to require that the mutual information between features is 
zero (see [1] for details). The exact details of ICA depend on the particular algorithm 
used to learn the basis from a training sample. Since independence is usually difficult to 
measure and enforce if d is large, ICA techniques tend to settle for less ambitious goals. 
The most popular solution is to minimize a contrast function which is guaranteed to be 
zero if the inputs are independent. Examples of such contrast functions are higher order 
correlations and information-theoretic objective functions [4]. In this work, we consider 
representatives from the two types: the method developed by Comon [5], which uses a 
contrast function based on high-order cumulants, and the FastICA algorithm [12], that 
relies on the negative entropy of the features. 

4 Experimental set-up 

In order to evaluate the recognition accuracy achievable with the various feature trans- 
formations, we conducted experiments on two image databases: the Brodatz texture 
database, and the Corel database of stock photography. Brodatz is a standard bench- 
mark for texture classification under controlled imaging conditions, and no distractors. 
Corel is a good testing ground for recognition in the context of natural scenes (e.g. no 
control over lighting or object pose, cluttered backgrounds). 

Brodatz contains 112 gray-scale textures that were broken down into 9 128 x 128 
patches, leading to a total of 1008 images. This set was split into two subgroups, a 
query database containing the first patch of each texture and a retrieval database con- 
taining the remaining 8. In the case of Corel, we selected 15 image classes 1 each con- 
taining 100 color images. We then created a query and retrieval database by assigning 
each image to the query set with a probability 0.2. 

All color images were converted to the YBR color space. Where applicable, the 
feature transformations were applied to each channel separately and the resulting fea- 
ture vectors combined by interleaving the color components according to the pattern 
YBRYBR .... For each channel, the feature space was 64-dimensional (three layers 
of wavelet decomposition and 8x8 image blocks in the remaining cases) and consec- 
utive observations were extracted with a step of 2 (Brodatz) or 4 (Corel) pixels in each 
of the x and y directions. Public domain software by the authors of the techniques was 
used for learning the feature transformations. All learning was based in two 100, 000- 
point samples extracted randomly from the retrieval databases. Figure 2 presents the 
basis functions learned from Brodatz for PCA, ICA with the method by P. Comon, and 
ICA with the FastICA algorithm, as well as the DCT basis (wavelet basis do not have 
block-based support and are not shown). 

Once the different bases were computed, all image patches were projected into 
each of them leading to a sample of feature vectors per image. Maximum likelihood 
(ML) parameters of a Gaussian mixture model were then estimated using the EM al- 
gorithm [7]. The number of Gaussian components was held constant (several values 
were tried with qualitatively similar results, here we report results with 8 components), 
and a joint density for each of the embedded subspaces Xi was obtained by downward- 
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Figure 2: Basis functions for DCT (top left), PCA (top right) ICA learned with 
Comon's method (bottom left) and ICA learned with the fastICA method (bottom 
right). 



projection of the joint density in X [24]. A Gaussian mixture with the same number 
of components was also fit to each of the scalar variables Xi to obtain the independent 
model. 

To double-check the independence results we computed various statistical mea- 
sures of independence. The first was the KL divergence between the joint and inde- 
pendent models KL [Px(x)|| \\i Pxi ( x i)]- Since we wanted an alternative measure of 
independence not affected by the quality of the mixture parameter estimates, we used 
histograms to compute this statistic. However, in order to avoid well known problems 
of histogram-based estimates in high dimensions, we only considered average pairwise 
divergences 

KL(Xi) = ^-L- £ KL [P Xi , Xj (xi,Xj)\\P X{ ( Xi )P Xj ( Xj )] . (26) 

These divergence are measures of pairwise independence and should be zero whenever 
independence holds. 

One popular way to measure dependencies of order larger than two is through high- 
order statistics, such as cross-cumulants. While the 2 nd order cross-cumulant 



Cum[Xi,Xj] = E[XiXj], Vt ^ j 



(27) 
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Figure 3: Independence measures on Brodatz. Left: curve of cumulative average KL 
divergence (i, ^,=i KL(Xj)). Right: cross-cumulant norm. 

is a measure of the correlation between two variables, the 4 t,l -order cross-cumulant 
Cum[Xi, Xj, X k , Xi] = EiXiXjXkXi] - E[XiXj]E[X k X{\ 

- ElXiX^ElXjXt] - EiXiX^XjXk], Vi f (j, k, I), 

can serve both as a measure of 1) linear fourth-order dependence and 2) distance from 
Gaussianity [12], and higher-order cumulants capture dependencies of higher order. 
Unfortunately, the number of terms in a cumulant grows exponentially with its order 
and the computations involved rapidly become infeasible. We computed cumulants up 
to 6* h -order, but omit the formulas. All cumulant information was summarized by the 
norm of the off-diagonal terms (cross-cumulants), e.g. 

||Cwn 4 ||= £ Cum 2 [X i ,X j ,X k ,X l ] (28) 
for the fourth-order cumulant. These statistics are zero when independence holds. 

5 Results 

Figure 3 presents the independence measures obtained on Brodatz. The curves on the 
left plot represent the cumulative average KL divergence (26) after reordering the Xj 
such that KL(Xj + i) < KL(Xj). These curves suggest the existence of two groups: 
the first, consisting of the ICA techniques, achieves significantly better pairwise in- 
dependence than the second, consisting of the decorrelating transforms. A somewhat 
different picture starts to emerge from the right plot which shows the evolution of the 
cumulant norm as a function of its order. While the ICA techniques (together with 
PCA) achieve the lowest cumulant norms, the slope of the curve (between 4 th and 




Figure 4: Recognition results on Brodatz. Left: Precision, at 30% recall, achieved with 
joint modeling. Right: Precision loss inherent to the independence assumption. 



Qth orc [ er ) j s larger than that of the wavelet features. This indicates that, for higher 
orders, the curves are likely to cross, in which case the wavelet representation would 
be the most independent. This observation is supported by the results that follow and 
suggests that minimizing cumulants up to a certain order does not really provide any 
independence guarantees, since the dependencies can simply become of higher-order. 

In order to evaluate recognition accuracy we measured precision at various levels of 
recall 2 . Since the results were qualitatively similar for all levels, we only present curves 
of precision, as a function of subspace dimension, at 30% recall on Brodatz and 10% 
recall on Corel. The left plot of Figure 4 shows the precision achieved on Brodatz with 
joint modeling. The right plot presents the associated precision loss 3 when the joint 
model is replaced by the product of the marginals. This precision loss is a measure of 
the dependence between the features, since both models should lead to the same result 
when independence holds. 

Two major conclusions can be taken from the figure. First, the ordering of trans- 
formations by degree of independence is quite surprising, with wavelets at the top, fol- 
lowed by PCA, the two ICA methods, and the DCT (as a distant last). While we want 
to avoid conclusions such as "feature transform X leads to weaker dependencies" that 
may not generalize to other databases, it is clear that this ordering is very different from 
that of Figure 3 (ICA techniques on top, then DCT and PCA, and finally wavelets). This 
can only mean that quantities such as pairwise KL divergence or a limited set of cross- 
cumulants do not really capture what is going on in terms of independence, at least 
the aspects that are important for recognition. While this is not completely surprising, 
since these measures only capture pairwise or linear dependencies, it clearly indicates 
that recognition is affected by much more sophisticated patterns of dependence. The 
logical conclusion is that ICA techniques designed to minimize measures such as those 

2 When the n most similar images to a query are retrieved, recall is the percentage of all relevant images 
that are contained in that set, and precision the percentage of the n which are relevant. 

3 By precision loss we mean the difference between the precision achieved with the joint and independent 
models. 
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Figure 5: Recognition results on Corel. Left: Precision, at 10% recall, achieved with 
joint modeling. Right: Precision loss inherent to the independence assumption. 



of Figure 3 may not always be of great use for recognition. 

Second, the precision curves seem to comply very well with the theoretical argu- 
ments of section 2. In particular, they are concave (there is a large increase in precision 
from 1 to 8 dimensions that we do not show for clarity of the graph), and tend to be 
flatter when the features are more independent. Remember that compliance with the 
theory implies that the curves are dominated by the Bayes error for all dimensions 
when the features are independent, and up to the inflection point when they are not. 
This is an important observation, since the more independent features (flatter curves) 
have smaller precision than that achieved at the inflection point of the less independent 
ones. In fact, a comparison of the two plots reveals significant evidence in support of 
the conjecture that precision at the inflection point is a monotonic function of the de- 
gree of dependence of the features! The natural conclusion is then that independence 
has a non-negligible cost in terms of Bayes error. In particular, the precision achieved 
with the most independent features (wavelet coefficients) is almost 10% bellow the 
peak precision achieved with the less independent ones (DCT). 

This conclusion is also supported by Figure 5, which presents recognition results 
on Corel. Since this is a larger database and contains colored images, 192 -dimensional 
feature space, the queries take significantly longer to compute. For this reason, we 
restricted the analysis to the first 64 dimensions (and only considered one of the ICA 
techniques) which are probably not enough to reach the inflection point in all cases. 
Nevertheless, one can still confidently say that the precision of the more independent 
feature transforms is roughly 10% lower than the peak precision of the less independent 
transforms. The only significant difference with respect to the results obtained on Bro- 
datz is that ICA does appear to produce features which are very close to independent, 
while the wavelet coefficients are not independent. 
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A Proof of Theorem 4 

Proof: The fact that the sequence of vector spaces is embedded follows from (16) since, 

V» e {l,...,d- 1} 

Xi = nt +1 (X i+1 ) (29) 

and, consequently, 

c (30) 

Inequality (17) then follows from (29), (3) and the fact that the mappings 7r] +1 (x) are 
non-invertible. 

To prove (20) we start from Corollary 1, i.e. 

A guXi = ^JfL[P Xi]y (x|fc)||p Xi , y (x|fc)], (31) 
k 

where P X; |y(x|fc) is the class-conditional likelihood function for Xj under class k. 
Since, from (29), Xj+i = (X,, Xi + i) where is the i + 1 th coordinate of Xj+i 

KL[Px i+llY (x\k)\\px i+llY (x\k)} = 
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A PROOF OF THEOREM 4 



I 



Px i+a | Xi ,y(zi+ik + (x),A) 



+ / P Xi+1 |y(xlfc)log : <+1 dx 
V + Px i |yK + (x)|fc) 

= / P Xl+1 |y(x *)log- : ■ u i u ax 

J Px i+a | Xi ,y(^+ikr (x),fc)P Xi |y(7r| + (x)|fc) 

= ifL[Px i+1 |y(x|fc)|| i 5 Xi+1 |x i ,y(x i+1 |7r] +1 (x),fc)Px i |y(7ri +1 (x)|fc)] 
+ ifL[P Xi | y (x|A;)||px i |y(x|A ; )] 

> JfL[P Xl |y(x|A!)||ft Cl | y (x|A!)] 

where we have used the non-negativity of the KL divergence [6]. Combining with (31) 
leads to (20). 
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